Little Knowledge Rules the Web: Domain-Centric Result Page Extraction
نویسندگان
چکیده
Web extraction is the task of turning unstructured HTML into structured data. Previous approaches rely exclusively on detecting repeated structures in result pages. These approaches trade intensive user interaction for precision. In this paper, we introduce the Amber (“Adaptable Model-based Extraction of Result Pages”) system that replaces the human interaction with a domain ontology applicable to all sites of a domain. It models domain knowledge about (1) records and attributes of the domain, (2) low-level (textual) representations of these concepts, and (3) constraints linking representations to records and attributes. Parametrized with these constraints, otherwise domain-independent heuristics exploit the repeated structure of result pages to derive attributes and records. Amber is implemented in logical rules to allow an explicit formulation of the heuristics and easy adaptation to different domains. We apply Amber to the UK real estate domain where we achieve near perfect accuracy on a representative sample of 50 agency websites.
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملS-CREAM: Semiautomatic CREAtion of Metadata
Richly interlinked, machine-understandable data constitute the basis for the Semantic Web. We provide a framework, SCREAM, that allows for creation of metadata and is trainable for a specific domain. Annotating web documents is one of the major techniques for creating metadata on the web. The implementation of S-CREAM, OntoMat supports now the semi-automatic annotation of web pages. This semi-a...
متن کاملGrammatical inference for information extraction and visualisation on the Web
The world-wide web contains a wealth of database-style information scattered across different sites that could be better used if it were integrated into a single view. Since document formats vary widely between sites and frequently mingle structural with presentation markup, extracting and integrating data from web pages is a difficult challenge. Manually writing extraction wrappers is expensiv...
متن کاملKnowledge Extraction by Using an Ontology Based Annotation Tool
This paper describes a Semantic Annotation Tool for extraction of knowledge structures from web pages through the use of simple user-defined knowledge extraction patterns. The semantic annotation tool contains: an ontology-based mark-up component which allows the user to browse and to mark-up relevant pieces of information; a learning component (Crystal from the University of Massachusetts at A...
متن کاملBuilding Intelligent Systems for Mining Information Extraction Rules from Web Pages by Using Domain Knowledge
Previous researches on automatic information extraction experienced difficulties in acquiring and representing useful domain knowledge and in coping with the structural heterogeneity among different information sources. As a result, many real-world information sources with complex document structures could not be correctly analyzed. In order to resolve these problems, this paper presents a meth...
متن کامل